Search CORE

4 research outputs found

Hardware-based task dependency resolution for the StarSs programming model

Author: Dallou Tamer
Juurlink Ben
Publication venue
Publication date: 01/01/2012
Field of study

Recently, several programming models have been proposed that try to relieve parallel programming. One of these programming models is StarSs. In StarSs, the programmer has to identify pieces of code that can be executed as tasks, as well as their inputs and outputs. Thereafter, the runtime system (RTS) determines the dependencies between tasks and schedules ready tasks onto worker cores. Previous work has shown, however, that the StarSs RTS may constitute a bottleneck that limits the scalability of the system and proposed a hardware task management system called Nexus to eliminate this bottleneck. Nexus has several limitations, however. For example, the number of inputs and outputs of each task is limited to a fixed constant and Nexus does not support double buffering. In this paper we present Nexus++ that addresses these as well as other limitations. Experimental results show that double buffering achieves a speedup of 54×/143× with/without modeling memory contention respectively, and that Nexus++ significantly enhances the scalability of applications parallelized using StarSs.EC/FP7/248647/EU/ENabling technologies for a programmable many-CORE/ENCOR

DepositOnce

Crossref

Nexus#: a distributed hardware task manager for task-based programming models

Author: Dallou Tamer
Elhossini Ahmed
Engelhardt Nina
Juurlink Ben
Publication venue
Publication date: 01/01/2015
Field of study

In the era of multicore systems, it is expected that the number of cores that can be integrated on a single chip will be 3-digit. The key to utilize such a huge computational power is to extract the very fine parallelism in the user program. This is non-trivial for the average programmer, and becomes very hard as the number of potential parallel instances increases. Task-based programming models such as OmpSs are promising, since they handle the detection of dependencies and synchronization for the programmer. However, state-of-the-art research shows that task management is not cheap, and introduces a significant overhead that limits the scalability of OmpSs. Nexus# is a hardware accelerator for the OmpSs runtime system, which dynamically monitors dependencies between tasks. It is fully synthesizable in VHDL, and has a distributed task graph model to achieve the best scalability. Supporting tasks with arbitrary number of parameters and any dependency pattern, Nexus# achieves better performance than Nanos, the official OmpSs runtime system, and scales well for the H264dec benchmark with very fine grained tasks, among other benchmarks from the Starbench suite

DepositOnce

Crossref

An Integrated Hardware-Software Approach to Task Graph Management

Author: Dallou Tamer
Elhossini Ahmed
Engelhardt Nina
Juurlink Ben
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Task-based parallel programming models with explicit data dependencies, such as OmpSs, are gaining popularity, due to the ease of describing parallel algorithms with complex and irregular dependency patterns. These advantages, however, come at a steep cost of runtime overhead incurred by dynamic dependency resolution. Hardware support for task management has been proposed in previous work as a possible solution. We present VSs, a runtime library for the OmpSs programming model that integrates the Nexus++ hardware task manager, and evaluate the performance of the VSs-Nexus++ system. Experimental results show that applications with fine-grain tasks can achieve speedups of up to 3.4×, while applications optimized for current runtimes attain 1.3×. Providing support for hardware task managers in runtime libraries is therefore a viable approach to improve the performance of OmpSs applications

DepositOnce

Crossref

Verbesserung der Skalierbarkeit von Mehrkernsystemen hinsichtlich der Verwendung von feingranularer Parallelität in task-basierten Programmiermodellen

Author: Dallou Tamer
Publication venue
Publication date: 01/01/2017
Field of study

In the past few years, it has been foreseeable that Moore's law is coming to an end. This law, based on the observation that the number of transistors in an integrated chip doubles every 18-24 months, served as a roadmap for the semiconductors industry. On the verge of its end due to the huge increase in integrated chips power density, a new era in computing systems has begun. In this era, a core’s single performance is no longer the most important parameter, but the performance of the whole multicore system. It is an era where multiplicity and heterogeneity of computing units became the norm in state-of-the-art systems on chips (SoCs), not the exception. New programming models emerged trying to bridge the gap between programming complexity and well-utilization of the multicore systems, with their various available resources. One promising approach is the dataflow task-based programming model such as StarSs/OmpSs and OpenMP, where an application is broken down to smaller execution units called tasks, which will dynamically be scheduled to run on the available resources according to data dependences between those tasks. However, this approach has an overhead as its runtime system needs considerable amount of computational power to track dependences between tasks and build the task graph, decide on ready tasks, schedule them on idle cores, and upon task completion, kick off dependent tasks, all this performed dynamically at runtime. Although dataflow task-based programming provides a solution to the programmability problem of multicore systems, its runtime overhead, in practice, has limited the scalability of applications programmed using this programming model, especially when the tasks are fine-grain, and/or have complex task graphs. The main contribution of this thesis is to offload the heaviest part of the runtime system - task graph management - to a dedicated hardware accelerator, in order to accelerate the runtime system, as well as to save some conflicts on using the shared resources (microprocessor cores and memory system) by the runtime system and the user applications. The first prototype is presented in the form of the Nexus++ co-processor for the StarSs/OmpSs programming models. Using traces of applications written in StarSs/OmpSs from the StarBench benchmark suite, Nexus++ significantly improves the scalability of those applications. Nexus# is the successor of Nexus++, which has an improved execution pipeline compared to Nexus++, in addition to parallelizing the process of task graph management itself in a distributed fashion. For example, running an application with coarse-grain independent tasks such as "ray tracing" and where a software-only parallel solution achieves 31x speedup compared to the serial execution, Nexus# and Nexus++ achieve speedups of 194x and 60x respectively. For the case of fine-grain tasks with complex inter-task dependencies as in H.264 video decoding, the software-only parallel solution is slower than the serial execution due to runtime overhead, but Nexus# and Nexus++ achieve speedups of 7x and 2.2x respectively. Nexus# outperforms Nexus++ in terms of scalability, which opens the door to support even finer-grain tasks. As described in this work, through its extensive reconfigurability, Nexus# presents a suitable hardware accelerator for various multicore systems, ranging from embedded to complex high-performance systems.Seit einigen Jahren ist absehbar, dass die Gültigkeit von Moore’s Law nach fast 50 Jahren zu Ende geht. Dieses “Gesetz” basierte auf der Beobachtung, dass sich die Anzahl der Bauelemente auf einem Chip etwa alle 18 Monate verdoppelte und legte den Grundstein für die Roadmap der Halbleiterindustrie. Aufgrund der enormen Zunahme der Leistungsdichte von Mikroprozessoren steht diese Gesetz kurz vor dem Ende und ein neues Zeitalter in der Computerindustrie bricht an. In diesem Zeitalter ist die Single-Thread-Performance kein so wichtiger Faktor mehr, im Vergleich zur Leistung des gesamten Multicore-Systems. Es ist ein Zeitalter in dem die Vielfalt und Vielzahl von Prozessoren die Norm in aktuellen System on Chips (Soc) ist und nicht die Ausnahme. Seitdem sind neue Programmiermodelle entstanden, die versuchen, die Lücke zwischen algorithmischer Komplexität und einer effizienten Nutzung aller Ressourcen der SoC-Hardware zu überbrücken. Ein vielversprechender Ansatz ist das Datenfluss taskparallele Modell wie StarSs/OmpS und OpenMP. In diesem wird das Anwendungsprogramm in kleine Arbeitseinheiten, engl. Task, zerlegt. Diese Tasks werden dann unter Berücksichtigung der Datenabhängigkeiten dynamisch auf geeignete und verfügbare Ressourcen verteilt und parallel abgearbeitet. Dabei muss zwischen verschiedenen Laufzeitsystemen und Varianten dieses Programmiermodells abgewogen werden. Bei der Auswahl müssen folgende Punkte berücksichtigt werden, da diese eine nicht unerhebliche Rechenleistung benötigen: die interne Verwaltung des Taskgraphen, das Verfolgen der Datenabhängigkeiten zwischen Tasks, die Entscheidung wann ein Task bereit zur Bearbeitung ist, die Verteilung auf oder Zuteilung von Prozessoren und schließlich, nach der Abarbeitung der Aufgabe, die Rückmeldung an das Laufzeitsystem, die die Freigabe von weiteren Tasks nach sich zieht. Für Datenfluss taskparallele Programmiermodelle, in vielen Fällen wenigstens theoretisch eine sinnvolle Möglichkeit zur Programmierung von Multi- und Many-core Systemen, stellt der Overhead im Laufzeitsystem in der Praxis vielfach eine unüberwindbare Hürde dar. Dies gilt insbesondere für sehr fein unterteilte oder sehr komplexe Datenfluss- oder Taskgraphen. Der Hauptbeitrag dieser Dissertation ist es, den größten Teil eines Laufzeitsystems, das Task-Graph-Management, auf einen dedizierten Hardwarebeschleuniger auszulagern. Ziel dabei ist es, die Laufzeit des Systems zu steigern und gleichzeitig einigen Konflikten vorzubeugen, die entstehen, wenn das Laufzeitsystem oder Anwenderapplikationen geteilte Ressourcen (Mikroprozessorkerne und Speichersysteme) verwenden. Als erster Prototyp wird die Hardware-Unterstützung für das StarSs/OmpS Programmiermodell in Form des Nexus ++ Co-Prozessors eingeführt und der Einfluss auf Skalierbarkeit und Systemleistung gemessen. Für Traces von Anwendungen aus der StarBench Benchmarksuite zeigte Nexus++ eine erheblich verbesserte Skalierbarkeit dieser in StarSs/OmpSs geschriebenen Programme. Nexus# ist eine überarbeitete Version von Nexus++. Zusätzlich zur Parallelisierung der Prozesse für die Task-Graph-Verwaltung, welche verteilt erfolgt, besitzt die Version eine verbesserte Ausführungspipeline gegenüber dem Nexus++. Hierdurch ist es z.B. möglich, eine Applikation mit grobunterteilten unabhängigen Tasks, wie dem "ray tracing", um den Faktor 194x bzw. 60x mit Hilfe des Nexus# und Nexus++ zu beschleunigen verglichen mit der seriellen Ausführung. Durch eine reine Softwarelösung würde die Parallelisierung dagegen nur einen Geschwindigkeitszuwachs um den Faktor 31x erreichen. Hiervon profitieren Applikationen wie die H.264 Videodekodierung, welche feingranulare Task hat, die zusätzlich noch untereinander abhängig sind. Eine rein softwarebasierte Lösung zur Parallelisierung wäre aufgrund des Laufzeit-Overheads langsamer als die serielle Ausführung. Im Gegensatz dazu erzielen Nexus# bzw. Nexus++ jedoch ein Geschwindigkeitszuwachs um den Faktor 7x und 2.2x. Nexus# übertrumpft Nexus++ hinsichtlich der Skalierbarkeit, was eine feingranulare Unterteilung der Tasks ermöglicht. Wie in dieser Arbeit dargestellt, ist der Nexus# aufgrund seiner umfangreichen Rekonfigurierbarkeit ein zeitgemäßer Hardwarebeschleuniger. Er eignet sich für eine breite Palette von Mehrkernsystemen, angefangen bei eingebetteten Systemen bis hin zu komplexen Hochleistungssystemen

DepositOnce